Add per-instance cost cap to swe-bench runner by juanmichelini · Pull Request #741 · OpenHands/benchmarks

juanmichelini · 2026-06-08T18:02:50Z

Why

Triggered by review of OpenHands/openhands-index-results#1167 — the Gemini-3.5-Flash swe-bench-verified run spent $1,912 across 500 instances (mean $3.82), but 22 instances cost more than $10 each and accounted for ~20% of the total spend. The worst single instance cost $44.24.

Root cause for the 22 expensive instances

All 22 had the same fingerprint compared to typical runs:

bucket	n	cache_read / prompt_tokens	mean events	mean prompt_tokens
HIGH (>$10)	22	10.3%	342	11.5M
MID ($1–10)	453	27.5%	195	2.6M
LOW (≤$1)	24	45.5%	122	0.8M

The worst case (django__django-16116, $44.24) fired the LLMSummarizingCondenser 4 times during its 1069 events (at events 367, 552, 737, 922). Each condensation rewrites the prompt prefix and therefore invalidates the provider's prompt cache, so the bulk of subsequent prompt tokens are billed at the full uncached price (~10× the cached rate). Combined with reasoning_effort=high (which adds reasoning tokens to every uncached call) and 300+ iterations, this multiplied out to:

uncached prompt: (32.0M − 3.6M) × $1.50/M = $42.68
cache reads: 3.6M × $0.15/M = $0.53
output (completion + reasoning): 196K × $9.00/M = $1.76
total ≈ $44.97 ✓ (matches the observed $44.24)

20 of the 22 high-cost instances were resolved, so the agent was making progress — it just took too many iterations and burned too much money doing so.

What this PR does

Adds a small, opt-in defence-in-depth measure: a --max-cost-per-instance CLI flag (also exposed as EvalMetadata.max_cost_per_instance, default None = disabled). When set, a callback pauses the conversation as soon as its accumulated cost exceeds the cap, mirroring the existing behaviour of max_iteration_per_run. The patch produced up to that point is still collected and submitted.

Scope

Wired into the swe-bench runner (benchmarks/swebench/run_infer.py) only, since that's where the regression surfaced.
The new max_cost_per_instance field lives on the shared EvalMetadata, so plumbing it into the other benchmark runners is a one-line change per runner in a follow-up.

What this does not fix

The underlying condenser cache-invalidation issue. Fixing that properly would need SDK-level changes (e.g. a condenser that keeps a stable cache prefix, or enforcement of Metrics.max_budget_per_task inside the run loop). Both are larger changes worth doing separately.

Files

New: benchmarks/utils/cost_cap.py — CostCapCallback class with deferred binding (the callback needs a reference to the conversation, which can only be obtained after construction). Defensive error handling so a misbehaving metrics or pause() call can never take an instance down.
New: tests/test_cost_cap.py — 9 unit tests using a fake conversation: rejects non-positive caps, no-op below cap, pauses at/above cap, idempotent once triggered, safe before binding, swallows metrics/pause failures.
Modified: benchmarks/utils/models.py — add max_cost_per_instance: float | None (gt=0).
Modified: benchmarks/utils/args_parser.py — add --max-cost-per-instance.
Modified: benchmarks/swebench/run_infer.py — construct the callback before Conversation, bind it after.

Test plan

pytest tests/test_cost_cap.py -v → 9 passed.
Smoke-tested that EvalMetadata(... max_cost_per_instance=0) is rejected by Pydantic and that the default value is None.
Smoke-tested the argparse plumbing: --max-cost-per-instance 7.5 parses to 7.5, absence parses to None.

Usage

# Default: no cap, behaviour unchanged.
python -m benchmarks.swebench.run_infer ...

# Cap per-instance cost at $10 (would have saved ~$240 on the
# Gemini-3.5-Flash run referenced above, with no impact on the
# 478 instances that finished under $10).
python -m benchmarks.swebench.run_infer ... --max-cost-per-instance 10

This PR was created by an AI agent (OpenHands) on behalf of @juanmichelini, in response to a review comment on OpenHands/openhands-index-results#1167.

@juanmichelini can click here to continue refining the PR

Some evaluations have a small minority of instances that consume disproportionately large amounts of money. For example, the Gemini-3.5-Flash swe-bench-verified run on PR OpenHands/openhands-index-results#1167 spent $1912 total across 500 instances ($3.82 mean), but 22 instances cost >$10 each and accounted for ~20% of the total spend, with a worst-case of $44.24 for a single instance. Root cause for those 22 instances: they triggered the LLMSummarisingCondenser multiple times (4x for the worst case). Each condensation rewrites the prompt prefix and therefore invalidates the provider's prompt cache. Their cache-read ratio averaged 10% versus 27% for typical instances and 45% for cheap ones, so the bulk of their tokens were billed at the full uncached price. Combined with reasoning_effort=high (which adds reasoning tokens to every uncached call) and 300+ iterations, this multiplied out to ~$44 on the worst instance. This adds a defence-in-depth measure: a `--max-cost-per-instance` flag (also exposed as `EvalMetadata.max_cost_per_instance`, default None = disabled). When set, a small callback pauses the conversation once the per-instance accumulated_cost exceeds the cap, mirroring the existing behaviour of `max_iteration_per_run`. The patch produced up to that point is still collected and submitted. This does not fix the underlying condenser cache-invalidation issue (which would need SDK-level changes), but it does cap the blast radius for any single instance across all models. Wired into the swe-bench runner first since that is where the regression surfaced; can be plumbed into the other benchmark runners in a follow-up. Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot · 2026-06-10T19:49:07Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Code Review Summary

🟡 Good taste — Well-motivated feature solving a real cost outlier problem. The implementation is clean and defensive.

What's Good

Real problem, real data: The PR addresses a genuine issue (22 instances costing >$10 each, ~20% of total spend). The root cause analysis in the PR description is solid.
Defence-in-depth approach: Correctly positioned as a last-resort cap rather than trying to fix the underlying cache/condenser issues at the SDK level.
Defensive error handling: Both pause() and get_combined_metrics() failures are caught and logged, preventing cascading failures.
Idempotent trigger: Once triggered, the callback skips subsequent calls — good for performance and correctness.
Comprehensive tests: All error paths covered (metrics failure, pause failure, unbound callback, idempotency).

Minor Suggestions

benchmarks/swebench/run_infer.py, Line 304 Type annotation: callbacks: list should be callbacks: list[Callback] for clarity, matching the Conversation's callback type expectation.
benchmarks/utils/cost_cap.py, Line 73 Type hint: bind(self, conversation: object) uses object — consider using a Protocol or BaseConversation type for better IDE support and documentation.
benchmarks/utils/models.py, Line 97-98 Documentation overlap: The field description repeats information already in the module docstring. Consider trimming to avoid drift.

No Blocking Issues

The implementation is sound, well-tested, and solves the stated problem without introducing unnecessary complexity.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟢 LOW

The change is additive-only (new optional parameter with no default enforcement), wraps external failures gracefully, and includes comprehensive test coverage. No breaking changes to existing functionality.

VERDICT:
✅ Worth merging: Core logic is sound, defensive coding is exemplary, and the feature solves a real operational problem.

KEY INSIGHT:
The two-phase binding pattern (pass callback to Conversation constructor, then bind conversation reference afterward) is the correct approach to ensure the callback participates in the composed chain from the first event — well thought out.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

juanmichelini requested a review from all-hands-bot June 10, 2026 19:48

juanmichelini marked this pull request as ready for review June 10, 2026 19:48

all-hands-bot approved these changes Jun 10, 2026

View reviewed changes

juanmichelini enabled auto-merge (squash) June 11, 2026 19:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-instance cost cap to swe-bench runner#741

Add per-instance cost cap to swe-bench runner#741
juanmichelini wants to merge 1 commit into
mainfrom
fix/per-instance-cost-cap

juanmichelini commented Jun 8, 2026

Uh oh!

all-hands-bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

juanmichelini commented Jun 8, 2026

Why

Root cause for the 22 expensive instances

What this PR does

Scope

What this does not fix

Files

Test plan

Usage

Uh oh!

all-hands-bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Code Review Summary

What's Good

Minor Suggestions

No Blocking Issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

all-hands-bot commented Jun 10, 2026 •

edited

Loading